This project is part of the “Explore and Summarize Data” module from Udacity’s Data Scientist Nanodegree Program.
To develop this project the chosen data set was Red Wine Quality, which is public available for research, and more details are described in:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
The exploratory analysis will be guided by the following question: Which chemical properties influence the quality of red wines?
The dataset variables are:
The variable types are:
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
We can observe that there are discrete and continuous variables, and the X variable is just an index for each observation in the dataset, so let’s remove it.
red_wine <- within(red_wine, rm(X))
Let’s see the distribution of our variables in the dataset.
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
The variables fixed.acidity, volatile.acidity, citric.acid, residual.sugar, free.sulfur.dioxide, and total.sulfur.dioxide presented high dispersion, which may mean the existence of outliers.
Regarding to the wine quality, ratings are among 3 and 8, being 6 the median quality value.
Towards an univariate analysis, let’s plot some histograms to understand the structure of the individual variables in the dataset.
Density and pH plots presented a normal distribution, while citric.acid, free.sulfer.dioxide, and total.sufer.dioxide presented a right skewed distribution. Outliers can be observed mainly for residual.sugar and chlorides plots.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
Given the summary and the plot of the quality feature, we can observe that most of the observations are classified as 5 or 6, which represents the median. Few examples were classified between 3 and 4, and 7 and 8, which represents the wines of low and high quality, respectively. Based on that, the data was grouped into 3 categories: low (< 5), average (< 7), and high ( > 7), as shown in the plot below.
## low average high
## 63 1319 217
The dataset contains 1,599 observations of different types of red wines and 11 chemical properties were considered in the analysis. Thus, the original dataset is composed of 12 features being 11 chemical properties and the score given by the experts, named as quality.
The main feature in the dataset is quality since it represents the experts’ opinion about the wines.
I think volatile.acidity, citric.acid, total.sufer.dioxide, pH, and the percent alcohol of the wine are the features that can support the investigation since they are the features that contribute most to the smell and taste of wine.
Yes, I created the ‘rating’ variable, which is a categorical representation of wine quality: low (< 5), average (< 7), and high ( > 7).
I have removed the X variable, which represented the dataset index.
In this section, we are going to explore the following features: volatile.acidity, citric.acid, total.sulfur.dioxide, pH, and alcohol.
## red_wine$rating: low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2300 0.5650 0.6800 0.7242 0.8825 1.5800
## --------------------------------------------------------
## red_wine$rating: average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1600 0.4100 0.5400 0.5386 0.6400 1.3300
## --------------------------------------------------------
## red_wine$rating: high
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3000 0.3700 0.4055 0.4900 0.9150
## red_wine$rating: low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0200 0.0800 0.1737 0.2700 1.0000
## --------------------------------------------------------
## red_wine$rating: average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0900 0.2400 0.2583 0.4000 0.7900
## --------------------------------------------------------
## red_wine$rating: high
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.3000 0.4000 0.3765 0.4900 0.7600
## red_wine$rating: low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.00 13.50 26.00 34.44 48.00 119.00
## --------------------------------------------------------
## red_wine$rating: average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 24.00 40.00 48.95 65.00 165.00
## --------------------------------------------------------
## red_wine$rating: high
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.00 17.00 27.00 34.89 43.00 289.00
## red_wine$rating: low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.300 3.380 3.384 3.500 3.900
## --------------------------------------------------------
## red_wine$rating: average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.860 3.210 3.310 3.311 3.400 4.010
## --------------------------------------------------------
## red_wine$rating: high
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.880 3.200 3.270 3.289 3.380 3.780
## red_wine$rating: low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.60 10.00 10.22 11.00 13.10
## --------------------------------------------------------
## red_wine$rating: average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.00 10.25 10.90 14.90
## --------------------------------------------------------
## red_wine$rating: high
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.20 10.80 11.60 11.52 12.20 14.00
Based on the boxplots, it was clear the relationship between the pH scale and the citric.acid values. With a lower pH, the citric value increases as the wine becomes more acidic, and wines with higher acidic level (pH < 3.27) have received the ‘high’ rating.
The plot below shows a negative correlation of -0.5419 between pH and citric.acid features.
## [1] -0.5419041
Alcohol and citric.acid presented important roles in the high quality wines, however there is no particular striking relationship between both features (positive correlation of 0.1099), as presented below.
## [1] 0.1099032
Still trying to relate the acidity to the alcohol level of the wine, the alcohol and the pH features presented a positive correlation of 0.2056, as shown in the following plot.
## [1] 0.2056325
Using a different feature, the plot below shows the relationship between alcohol and density. They presented a negative correlation of -0.4961. In other words, the higher the alcohol level, the lower the density of wine.
## [1] -0.4961798
The feature volatile.acidity represents the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste. Thus, in the boxplots it is possible to observe the high relationship between this feature and the quality rating, since wines with elevated volatile.acidity obtained low quality rating, whereas wines with lower volatile.acidity obtained high quality rating. For the wines that obtaines high as quality rating, the 3rd Quartile value (0.4900) is lower than the median value (0.5400) of the boxplot that represents the wines that obtained average quality rating. In other words, the concentration of volatile.acidity is lower for the high quality wines.
The median values presented in the boxplots of the citric.acid feature were: 0.0800 for low, 0.2400 for average, and 0.4000 for high quality rating. This means that high quality wines present higher concentration of citric.acid, which is inversely proportional to that presented in the volatile.acidity plots.
For the total.sulfur.dioxide feature, there was not a clear correlation to the quality feature, since the boxplots presentes very close median for low and high quality wines (26.00 and 27.00, respectively).
The pH feature describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic), most wines are between 3-4 on the pH scale. According to the boxplots, wines with pH above 3.380 are considered of low quality, whereas wines with pH scale lower than 3.310 or 3.270 can be considered of average or high quality, respectively. In other words, acidic wines are better.
Regarding the percent alcohol content of the wine, the boxplots show that the higher the percentage of alcohol the better.
Volatile.acidity and citric.acid presented a high negative correlation (-0.5524).
Although wines with higher alcohol content and higher acidity have received the high quality classification, the relationship between these features is not very significant, being a positive correlation of 0.2056 between alcohol and pH, and of 0.1099 between alcohol and citric.acid.
Other interesting relationship was observed between alcohol and density. They presented a negative correlation of -0.4961. In other words, the higher the alcohol level, the lower the density of wine.
The strongest relationship was found for volatile.acidity and citric.acid, they presented a negative correlation of -0.5524.
Based on the results of the previous section, when comparing citric.acid and volatile.acidity, we observed that most of the high quality wines presented high citric.acid concentration and low volatile.acidity concentration. The reverse is true for wines that have obtained low quality rating.
The pH and alcohol features were also analyzed previously. In the plot below it is possible to see how the highest pH contributed to the low classification rating of red wines.
It seems that alcohol is an important characteristic for classification, so we compare this variable with others that may directly impact the high or low quality rating of a wine.
In the following plot, we observed that low quality wines have higher density and low alcohol level.
For alcohol and volatile.acidity features it is clear that low volatile.acidity and high alcohol level are very important to the wine classification as high quality.
Other important feature is citric.acid, however when comparing it to alcohol, there is nothing too striking about the concentration of these features to producing low or high quality wines.
For the multivariate analysis six features were considered: alcohol, pH, volatile.acidity, citric.acid, density, and rating (categorical for quality).
When grouped together, the role of each of these chemical properties in the manufacture of high quality wines is evident:
Considering the important role of alcohol level, we also compared it with other features. When compared to volatile.acidity it was clear that low volatile.acidity and high alcohol level are very important to the wine classification as high quality. However, when alcohol was plotted with citric.acid, no clear relationship was observed.
I was surprised that there was no clear relationship between alcohol and citric.acid.
## red_wine$rating: low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.60 10.00 10.22 11.00 13.10
## --------------------------------------------------------
## red_wine$rating: average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.00 10.25 10.90 14.90
## --------------------------------------------------------
## red_wine$rating: high
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.20 10.80 11.60 11.52 12.20 14.00
This plot is interesting because the boxplots show that the higher the percentage of alcohol the higher the quality of wine. The median alcohol level for high quality wine is 11.60 and the mean is 11.52. For the low quality wines, the 3rd Quartile was 11.00.
In this plot it is possible to observe the importance of volatile.acidity and citric.acid to obtain high quality wines. Most of the high quality wines (yellow points) presented high citric.acid concentration and low volatile.acidity concentration, whereas the low quality wines (violet points) presented low citric.acid concentration and high volatile.acidity value.
Results similar to those presented in the previous graphs can be observed here when we compare the level of alcohol with the volatile.acidity. A high concentration of acid volatility contributes to the production of low quality wines, while high alcohol content contributes to the production of high quality wines.
The dataset analyzed contains 1,599 observations of different types of red wines and 11 chemical properties were considered in the analysis. Thus, the original dataset is composed of 12 features being 11 chemical properties and the score given by the experts, namely as quality.
The quality score range from 1 to 10. Given the summary of this feature, we observed that most of the instances are classified as 5 or 6 and only a few ones were classified between 3 and 4, and 7 and 8. Based on that, the data was grouped into 3 categories, namely as: low (for quality score less than 5), average (for quality score less than 7), and high ( for quality score higher than 7).
Based on an initial analysis, volatile.acidity, citric.acid, total.sufer.dioxide, pH, and the percent alcohol of the wine were the features that considered to support the investigation since they are the features that contribute most to the smell and taste of wine.
Based on the plots produced, it was possible to observe that not all the features presented a definitive role in the wines classification. Volatile.acidity, citric.acid, and alcohol level are the ones that stood out the most.
Considering the process itself, it was very important to note that even the dataset containing not so many features, not all are representative for the classification task. In addition, this whole process of exploiting the data through graphics is laborious but can save us a lot of time during modeling.